Apparatus for variable word length computing in an array processor

ABSTRACT

A computational unit comprises a processor having a plurality of processing elements, each having an arithmetic logic unit, and a controller for controlling the processor elements. The processor can provide a respectivef bit of a multiple bit word to each of the processor elements and enables signals to be transmitted between the arithmetic logic units to enable the units to perform a parallel operation on the bits of the multiple bit word. Extension circuitry is provided for selectively coupling one or more computational units together to combine their parallel processing capability.

FIELD OF THE INVENTION

[0001] The invention generally relates to a processing apparatus andmore particularly relates to a processing apparatus that contains anumber of processing units capable of operating in parallel.

BACKGROUND

[0002] Designing a modern microprocessor is a complex task that demandscareful balance between cycle times, instruction set architecture,instruction latency, otherwise known as cycle-per-instruction, andfinally die area costs. Many traditional microprocessors are designed toexecute a single instruction at a time. The processor executesinstructions in a serial fashion. This paradigm generally implies asingle processing core. The performance of such microprocessors has beenimproved by two basic approaches. The first of these is the data pathwidth. Over time the data path width has increased from the conceptual1-bit Turing Machine, to some of the latest 128-bit processors. Second,the performance of the processor has also been improved by increasingthe rate at which instructions are executed i.e. the clock frequency hasbeen increased. This increase has taken the logic from 33 MHz to 1.4 GHzin approximately ten years. While the above developments have providedconsiderable increases in the performance of “serial” microprocessorsthere are tasks to which they are not well suited.

[0003] One task to which serial processors are not well suited is themanipulation of multi-media data. Multi-media data is an example ofso-called parallel data. Parallel data is data where the individual dataare independent of one another. Such data can be processed in parallelas the manipulation of one datum does not require results from themanipulation of other data. It is also the case that multi-media datagenerally only requires simple manipulation. This implies that thecomplexity of the processor can be reduced with the absolute number ofprocessors increasing to process the data in parallel. This has resultedin an evolution towards word-parallel computing, which offers a betterbalance between cycle time and instruction latency.

[0004] One of the major shortcomings with such an approach is the factthat for certain types of processing, namely, multimedia applications,very wide data paths are very often unutilized. To design around thisSIMD extensions were introduced, which divide the existing data pathinto a number of narrower data-paths, such that the instruction could beexecuted on a number of data samples concurrently. One, widely knownexample of this is MMX processing unit in Intel's Pentium processor,which is applied to 64-bit data path.

[0005] Single Instruction Multiple Data (SIMD) Processing is aprocessing paradigm that is suited to the processing of parallelmulti-media data. The concept of Single Instruction Multiple Data (SIMD)processing architectures have been known for some time. However,historically these processing architectures have encountered problemsincluding high power consumption. This high power consumption is mainlya result of the line resistance associated with the large number ofinterconnects associated with an array of SIMD processors. Theresistance associated with this power consumption also reduces the speedof communications.

[0006] Interconnect resistance can also be incurred through thearrangement of SIMD processors and memory. As SIMD processing isgenerally performed with an array of processors there can be differencesin the interconnect length and the memory. One method of mitigating theabove issues has been the tight integration of processors and memory asoutlined in U.S. Pat. No. 5,956,274 issued on 21^(st) Sep. 1999 toDuncan Elliot, et al, ('274 patent).

[0007] The '274 patent generally teaches the placement of processorsdirectly adjacent to a memory array and more particularly teaches aconfiguration that reduces the memory column to processor ratio. Thearrangement taught in the above patent greatly reduces resistance anddelay issues. Further the arrangement taught in the '274 patent reducestiming problems that are a result of uneven interconnect line length.

[0008] SIMD processing architectures are often implemented with 1-bitprocessors which can result in bit serial processing. This is the caseof the '274 patent. However, bit serial processing introduces its ownprocessing problems including: realignment of the data for bit-serialprocessing (commonly referred to as corner-turning), non-uniform cycleexecution time, and increased instruction latency. Another approach forconverging on optimal architecture is limiting the scope ofapplications, reducing the flexibility of such processor.

[0009] A desired architecture of SIMD processing includes a balancebetween the flexibility and the efficiency. Processing units with avariable, dynamically re-configurable, data-path width would allow forgreatly improved flexibility at minimal impact on the efficiency. Itwould be advantageous to be able to adjust the width of the data path ofsuch processing element to the width of the data word required by agiven the application, maintaining word-parallel instruction execution.

[0010] It is also often the case in SIMD processing that a word whoselength is greater than the bit width of the processor must be alignedsuch that processing occurs in a serial manner with the word now beingprocessed serially through a processing element. This requirement onceagain forces the processing to be limited by the throughput of theprocessing element.

[0011] Therefore there is a need for a means for creating SIMD basedprocessing units whose data path width can be varied to match the wordlength of the data word to be processed.

SUMMARY OF INVENTION

[0012] According to one aspect of the present invention, there isprovided a circuit comprising a processor having a plurality ofprocessor elements, each having an arithmetic logic unit, and acontroller for controlling said processor elements, means for providinga respective bit of a multiple bit word to each of the processorelements, and transmission means for enabling signals to be transmittedbetween said arithmetic logic units, to enable the units to perform aparallel operation on the bits of the multiple bit word.

[0013] Advantageously, the processor provides an arrangement ofprocessor elements which can operate together in parallel to processmultiple bit data.

[0014] In one embodiment, each arithmetic logic unit has an output andan input, and said transmission means includes means for coupling theoutput of each ALU directly to the input of its adjacent ALU whichprocesses a higher order bit.

[0015] In another embodiment, the transmission means includes means forcoupling the output of the ALU for processing the most significant bitof the multiple bit word directly to the input of the ALU for processingthe least significant bit of the multiple bit word.

[0016] In another embodiment, the processor element for processing theMSB of the multiple bit word has input, and the transmission meansincludes coupling means for coupling the output of the ALU forprocessing the LSB directly to the input of MSB processing element.

[0017] Another aspect of the invention provides a circuit andarchitecture of processing elements such that the effective arrangementof processing elements can be dynamically altered such that the datapath width matches the word length of the data word to be processed.

[0018] According to another aspect of the present invention, there isprovided a circuit comprising a plurality of computational units, eachhaving at least one processor element and extension circuitry forswitchably enabling the transmission of signals between thecomputational units to enable the units to perform a parallel operationon a multiple bit word wherein at least one bit of said word is providedto each computational unit.

[0019] Advantageously, this arrangement enables any number ofcomputational units, each of which is able to process at least one bitof data at a time, to be coupled together to parallel process a multiplebit word, for example whose length is greater than the word that can beparallel processed by an individual computational unit.

[0020] In one embodiment, the circuit comprises a first CU and a secondCU, said first CU having a plurality of processor elements each havingan arithmetic logic unit arranged together for performing paralleloperations on multiple bit data, and wherein said extension circuitry isarranged to enable an MSB output from the MSB ALU of the first CU to betransmitted to the input of the LSB ALU of the second CU.

BRIEF DESCRIPTION OF FIGURES

[0021]FIG. 1 shows a schematic diagram of a data processor according toan embodiment of the present invention;

[0022]FIG. 2a is a block diagram illustrating computational units andconnections therebetween according to one embodiment of the invention;

[0023]FIG. 2b is a schematic diagram of another embodiment of theinvention;

[0024]FIG. 3 shows a schematic diagram of a data processor includingextension circuitry, according to one embodiment of the invention;

[0025]FIG. 4 is a schematic diagram of a data processor according to anembodiment of the invention, and

[0026]FIG. 5 shows a schematic diagram of an array of computationalunits and various schemes for interconnecting the units, according to anembodiment of the present invention.

DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

[0027]FIG. 1 shows a data processor according to an embodiment of thepresent invention. The data processor 1 comprises a computational unit 3having a plurality of SIMD based processing elements 5, 7, 9, 11, eachof which has access to a memory 13. Each processor element 5, 7, 9, 11has an arithmetic logic unit (ALU) 15, 17, 19, 21 each having an inputport 23 and an output port 25. The input port 23 of each ALU is directlycoupled to the output port of the neighboring ALU to its right to enabledata (e.g. a bit) to propagate from the output 25 to the input 23 ofadjacent ALUs. Each processor element further comprises one or moreregisters 27 and a multiplexer 29 for switchably coupling one or more ofa plurality of inputs to a register. In this embodiment, inputs to themultiplexer include the output from its local ALU, the outputs from itsneighboring ALUs to its left and right, and an output from the memory13.

[0028] The data processor 1 has a control circuit 31 (which is alsoreferred to herein as a boundary circuit) having a first input port 33which is coupled to the output port 25 of the ALU 15 of the first (i.e.left most) processor element 5, a first output port 35 coupled to theinput port 23 of the last (i.e. right most) processor element 21, asecond input port 37 coupled to the output port 25 of the ALU 21 of theright most processor element 21, and a second output port 39 coupled toone of the input ports of the multiplexer 29 of the left most processorelement 5.

[0029] In this embodiment, the control circuit 31 further includes athird input port 41 for receiving external data, for example a set/resetbit (SRB). The control circuit also has a third output port 43 foroutputting data which is broadcast to all of the processor elements 5,7, 9, 11. The broadcast data may for example comprise a set/reset bit,for example received at the third input port 41, a bit output from thefirst ALU 15, received at the first input port 33, or a bit output fromthe last ALU 21 and received at the second input port 37.

[0030] The control circuit 31 controls the transmission of data to oneor more processor elements within the processor to enable the processorto operate on multiple bit data (or words). For example, the controlcircuit 31 may be arranged to enable a barrel shift, in which data isshifted from one processor element to its adjacent processor element,either to the right or to the left, and data from the end most processorelement towards which the shift is directed is fed by the controlcircuit 31 to the processor element at the opposite end. Thus, for aleft barrel shift, the control circuit 31 is configured to output thedata received at the first input port 33 from the ALU of the left mostprocessor element 5 to the input 23 of the ALU 21 of the right mostprocessor element 11. Conversely, for a right barrel shift, the controlcircuit 31 is configured to pass the data received at its second inputport 37 from the output 25 of the right most ALU 21 to the left mostprocessor element 5. As mentioned above, the control circuit 31 may alsobroadcast data which is common to two or more processor elements to theappropriate processor elements, so that for example the data istransmitted to each processor element simultaneously.

[0031] Thus, the data processor 1 is configurable for processingmultiple bit data. Each processor element may comprise a single bitprocessor, and the data processor may contain any number of one-bitprocessor elements, for example 2, 4, 8, 16, 32 or any other number.

[0032] In operation, the data processor may be arranged such that eachof the processor elements receives a single bit of a multiple bit word,e.g. from the memory 13, or from any other source, such as a differentmemory or device, so that, for example, the left most processor element5 receives the most significant bit (MSB) and the right most processorelement receives the least significant bit (LSB) of the word to beprocessed. If read from the memory 13, the multiple bit data may bestored in the memory in such a way that the processor elements receivethe data bits in parallel. For example, the memory may contain aplurality of memory segments, each having a read access port 8, which iscoupleable to a respective processor element. Each data bit may bestored in a different segment and thereby read out of memory in parallelinto the processor elements. After the data has been read into theprocessor elements, the processor elements are controlled to process thedata as a multiple bit word (i.e. a word having an MSB and an LSB). Inone embodiment, the data processor may be incorporated in a SIMDprocessor, and each processor element may be controlled in parallel byan array controller 45. The processor elements may be adapted to becapable of performing multiple step operations and able to store theintermediate results of these operations, thereby reducing the frequencyof memory accesses. After a process has been completed, for example onone or more data words, the result of the process (e.g. a multi-bitword) may be written to memory, e.g, the local memory associated witheach processor element, another portion of memory or output to anotherdevice. Write operations by the processor elements may be controlled inparallel, so that the word bits are output as a unitary word from theprocessor. Write operations from each processor element may becoordinated and controlled by the array controller 45.

[0033] In contrast, in the SIMD processor architecture disclosed in U.S.Pat. No. 5,956,274 (Duncan Elliott), in which a processor element isprovided under each column of memory, multiple bit data can only beprocessed serially by a single processor element, and therefore the datamust be read from the memory in series, and the processing element canonly process one bit of the serial data at a time. Thus, for writeoperations to memory, DE's processor either requires additionalcircuitry, such as a 2D array of registers to enable date to be turnedbefore being written to memory, or requires the rotation of data into asingle column to be performed by a number of processor elements equal tothe number of bits in the data.

[0034] Returning to the present embodiment, the control circuit 31 has awrite control input port 47 for receiving a write control signal, e.g. awrite enable (IWE) signal from the array controller 45. The controlcircuit 31 may control write operations in response to both the writecontrol signal from the array controller and another state associatedwith the data processor, for example a state recorded in the controlcircuit 31. The control circuit 31 has a write control output port 49for outputting a write enable signal, which in this embodiment may bepassed or broadcast to each of the write enable lines 12 (which may becoupled together by a line 51) associated with each I/O memory port 8for enabling a bit of a multiple bit word to be output by a respectiveprocessor element to its respective I/O port for storage in the memory13. In another embodiment, circuitry may be provided for directing dataoutput by the processor elements to another part of memory, or toanother device, and circuitry may enable the data to be selectivelydirected to one of the local memory and another destination, asdisclosed in the applicant's copending applications, Attorney docketNos. 79135-4 and 79135-5 filed on 4^(th) Mar. 2002, the disclosures ofwhich are incorporated herein by reference.

[0035] The data processor may be adapted such that the processorelements can be reconfigured from operating together as a multiple bitword processor to operating individually or separately as independentelements. In this embodiment, the boundary circuit, which passes signalsto the processor elements required for the PEs to operate together onmultiple bit words would be conditioned or configured to enable the PEsto operate independently. To enable independent write operations fromeach PE, the write control circuitry, which controls multiple bit wordwrite operations (i.e. when the processor elements are operating inparallel for multiple bit word processing) may be adapted to selectivelyenable each PE to perform independent write operations. Thus, instead ofthe write operations of all PEs being controlled by the same writeenable signal, each PE would be controlled by a separate write controlsignal. In one embodiment, the data processor may be dynamicallyreconfigurable between a multiple independent PE processor, and amulti-bit word parallel processor, so that, for example the operation ofthe processor can be switched between these two operating modes betweensuccessive processes.

[0036] Embodiments of another aspect of the present invention provide asystem for grouping SIMD based processing elements into a processingunit whose bit width can be varied to match that of the word beingcomputed. As such it allows for the exchange of signals required forcoordination and proper operation of the processing elements that areelements of the processing unit.

[0037]FIG. 2a is a schematic block diagram of an embodiment of theinvention. A data processor 101 comprises a plurality of computationalunits 103, 105, each of which performs processing functions and containsat least one processing element (which may or may not be a SIMD basedprocessing element). Each computational unit 100 has a bit width equalto the number of processing elements times the bit width of theprocessing elements. A boundary circuit 107,109 is connected to andassociated with each computational unit 103, 105. An extension circuit111 is located between and connected to the two computational units103,105 and the two boundary circuits 107,109 associated with thecomputational units 103, 105. The extension circuit 111 is used tocombine computational units to widen the effective data path. Forexample, the extension circuit allows two N-bit computational units 103,105 to be combined such that a 2N bit wide processing unit is formed.Each boundary circuit 107, 109 provides for the distribution of signalsto its associated computational unit 103,105, as for example describedabove in connection with the embodiment shown in FIG. 1. The basicrepeating unit of circuits is presented as a first grouping 113 and theminimum grouping required for the formation of a 2N bit widecomputational unit is shown as a second grouping 115.

[0038] Another embodiment of the invention is presented in FIG. 2b. Inthis embodiment a memory such as a Random Access Memory (RAM) 117 isdirectly connected to the computational units 103,105. The memory 117 isfurther connected to and is accessible through a bus 119. In thisembodiment the computational units 103, 105 communicate directly withthe memory 117 without having to use the bus 119.

[0039] A data processor according to another embodiment of the inventionis illustrated in FIG. 3. The data processor 201 has first and secondcomputational units (CU) 203, 205, each comprising a plurality of singlebit processor elements 207, 209, 211, 213. The number of PEs in eachcomputational unit be may selected depending on the application. Forexample, in one embodiment, each computational unit may contain 8 PEs sothat each unit can parallel process 8 bit data. The number ofcomputational units may also depend on the application. For example, ifthe processor is required to parallel process both 8 bit and 16 bitdata, a minimum of two computational units would be required. (However,an 8-bit computational unit may be configured for processing 16-bitdata, by processing one byte of the 16-bit data at a time).

[0040] Each processor element has an arithmetic logic unit (ALU) 215,217, 219, 221 having an input port 223 and an output port 225, and oneor more registers 227, which may provide data to one or more otherinputs of the ALU. In this embodiment, the PEs in each (CU) are arrangedin a one dimensional array and are arranged in an order corresponding tothe bit order in a multiple bit word, so that the left most PE is in theMSB position and the right most PE is in the LSB position. The processorelements in the embodiment may be similar to and have any of thefeatures of the processor elements of the embodiment shown in FIG. 1.

[0041] Each computational unit (203, 205) has an associated boundarycircuit 229, 231 for controlling the transmission of data to theprocessor elements required for operation of the computational unit as aparallel processor. As for the embodiment described above and shown inFIG. 1, the outputs of both the MSB and LSB processor elements of eachcomputational unit 203, 205 are coupleable to its respective boundarycircuit 229, 231. Each boundary circuit is also coupleable to outputdata to the input 223 of its associated LSB PE 209, 213 and to outputdata to its associated MSB PE 207, 211. The boundary circuit can alsobroadcast data to a plurality of processor elements e.g. via the O/Pport 212 and may receive external data via an I/P port 214, for examplefrom the array controller (not shown).

[0042] An extension circuit 233 is provided to switchably couple thefirst and second computational units 203, 205 and their associatedboundary circuits together to combine their individual word lengthparallel processing capacity, for example from an individual capacity of8 bits to a combined capacity of 16 bits. The extension circuitrycomprises first, second, third and fourth selector switches 251, 253,255, 257, (which may comprise multiplexers or any other suitable switch)each having first and second input ports 259, 261, an output port 263and a control input port 204. The first input port 259 of the firstselector switch 251 is coupled to the output port 225 of the MSB ALU 219of the second CU 205, the second input port 261 is coupled to an outputport 266 of the first boundary circuit 229, and the output port 263 ofthe first selector switch is coupled to the input port 223 of the LSBALU 217 of the first CU 203, and in this embodiment is capable to aninput of the LSB PE 209 of the first CU 203.

[0043] The first input port 259 of the second selector switch 253 iscoupled to the output port of the LSB ALU 217 of the first CU 203, thesecond port is coupled to an output 268 of the second boundary circuit231, and the output port 263 of the second selector switch is coupled toan input 270 of the MSB processor element 211 of the second CU 205.

[0044] The first input port 259 of the third selector switch 255 iscoupled to the output port 225 of the LSB ALU 217 of the first CU 203,the second input port 261 is coupled to the output port 225 of the LSBALU 221 of the second CU 205, and the output port 263 is coupled to aninput 272 of the first boundary circuit 229.

[0045] The first input port 259 of the fourth selector switch 257 iscoupled to the output port 225 of the MSB ALU 219 of the second CU 205,the second input port is coupled to the output 225 of the MSB ALU 215 ofthe first CU 203, and the output port 263 is coupled to an input 274 ofthe second boundary circuit 231.

[0046] A control signal input 276 is provided for receiving controlsignals for controlling the selector switches 251, 253, 255, 257, from acontroller, such as an array controller for controlling thecomputational units.

[0047] The extension circuit 223 has a first operating mode or state, inwhich the first and second CUs are decoupled and have their individualparallel processing capability, and a second operating mode or state,which couples the CUs together to combine their parallel processingcapability, i.e. for parallel processing a word having length which isthe sum of the lengths of the words that they can parallel processindividually.

[0048] In the first (i.e. decoupled) mode, the first selector switch 251couples the output of the first boundary circuit 266 to the input of theLSB ALU 217 of the first CU 203, the second selector switch 253 couplesthe output 268 of the second boundary circuit 231 to the input 270 ofthe MSB PE 211 of the second CU 205, the third selector switch 255couples the output 225 of the LSB ALU 217 of the first CU 203 to aninput 272 of the first boundary circuit 229, and the fourth selectorswitch 257 couples the output of the MSB ALU 219 of the second CU 205 toan input 274 of the second boundary circuit 231.

[0049] In the second, coupled mode, the first selector switch 251couples the output 225 of the MSB ALU 219 of the second CU 205 to theinput 223 of the LSB ALU 217 of the first CU 203, the second selectorswitch 253 couples the output port 225 of the LSB ALU 217 of the firstCU 203 to an input 270 of the MSB processor element 211 of the second CU205, the third selector switch 255 couples the output 225 of the LSB ALU221 of the second computational unit to an input 272 of the firstboundary circuit 229, and the fourth selector switch 257 couples theoutput 225 of the MSB ALU 215 of the first computational unit 203 to theinput 274 of the second boundary circuit 231.

[0050] Thus, in the coupled mode, the extension circuit provides therequired connections to integrate the two arrays of processor elementsof the first and second CUs into a parallel processor having thecombined number of PEs. The MSB processor element 207 of the first CU203 functions as the MSB PE of the extended processor, the LSB processorelement 213 of the second CU 205 becomes the LSB PE of the extendedprocessor. The LSB processor element 209 of the first CU and MSBprocessor element 211 of the second CU become adjacent intermediate PEsin the extended contiguous array of processor elements.

[0051] In coupled mode, a bit propagate bus is formed, via the firstselector switch, from the output of first ALU of the second CU and theinput of the last ALU of the first CU to complete a continuous propagatechain through the series of ALUs of the extended processor, and theoutput of the first ALU of the first CU is coupled, via the fourthselector switch to an input second boundary circuit. These twoconnections enable, for example, a left barrel shift in the extendedprocessor, the bit from the MSB ALU 215 of the first CU beingtransmitted to the input of the LSB ALU of the second CU via the bus281, the fourth selector switch 257 and the second boundary circuit.

[0052] In coupled mode, a connection is formed between the output 225 ofthe last ALU 217 of the first CU and the input 270 of the first PE ofthe second CU, via the second selector switch 253, to permit bitpropagation therebetween, and a bus 283 is formed between the output 225of the last ALU 221 of the second CU and an input 272 to the firstboundary circuit 229, via the third selector switch 255. Theseconnections provide the required connections, for example, for a rightbarrel shift, the bit output from the LSB ALU of the extended processorbeing transmitted to an input 278 of the MSB processor element via thebus 283, the third switch 255 and the first boundary circuit 229.

[0053]FIG. 4 shows a schematic diagram of a boundary circuit accordingto an embodiment of the present invention in more detail. Referring toFIG. 4, two boundary circuits 307, 309 are shown, together with theirrespective computational units 303, 305, and an extension circuit 311for coupling the computational units and the boundary circuits togetherto form an extended parallel processor. The figure also shows part oftwo further extension circuits 315, 317, one to the left of the first CUand one to the right of the second CU, illustrating that the CUs may bepart of an extended array of any number of computational units.

[0054] Each boundary circuit 307, 309 has first, second and thirdmultiplexers 319, 321, 323, an AND gate 325 and a plurality of registers327, 329, 331, 333, and 335.

[0055] Each boundary circuit is connected to an MSB bus 337, forcarrying MSB signals either from the MSB ALU of its associated CU, ifthe CU is operating independently (or it functions as the left most CUof a coupled CU system and therefore carries the MSB of the extendedprocessor), or the MSB bus carries the MSB from the output of the MSBALU in a composite CU system. Similarly, each boundary circuit isconnected to an LSB bus 339, for carrying LSB signals either from theLSB ALU of its associated CU, if the CU is operating independently (orit functions as the right most CU of a coupled CU system), or the LSBALU carries the LSB from the output of the LSB ALU in a composite CUsystem.

[0056] Each boundary circuit is connected to a common General PurposeInput (GPI) bit line 341, which carries SRB signals from an arraycontroller for controlling operations of the CU's. Each CU is alsoconnected to a common write control bus 343 for carrying write enablesignals from the array controller, for controlling write operations fromthe CU's.

[0057] Each boundary circuit has a bus 345 connected between the outputof the MSB ALU and the input of the LSB ALU, via the first selectorswitch 351 of the extension circuit, for carrying a signal designatedIAO. This bus 345 is connected to the output of the first multiplexer319, whose inputs are connected to the MSB, LSB, GPI buses, and anoutput of each of the five registers. The inputs of the five registersare also coupled for receiving MSB, LSB and GPI signals from MSB, LSBand SRB buses via the third multiplexer 323. Thus, the IAO signal can beany of the MSB or LSB of the local CU, the MSB or LSB of a composite CUsystem, or an SRB signal.

[0058] The second multiplexer 321 is coupled for receiving the GPIsignal and signals from the third, fourth and fifth registers (which canlatch the SRB signal, the local or composite system MSB and LSBsignals), and can broadcast any of these signals (which may be referredto as a broadcast bit, BB) to all ALUs of the local CU simultaneously.

[0059] The three inputs to the AND gate are connected respectively tothe WRITE ENABLE control signal bus 343, and the output of the first andsecond registers 327, 329, and the output of the AND gate is used tocontrol write operations from the CU processor elements, for example tomemory. It is to be noted that write operations are not only controlledby the array controller, but also by the local CU, via the boundarycircuit, and in this embodiment are conditional on the content/state ofboth first and second registers.

[0060] In this embodiment, additional logic 390 is provided forreceiving the contents of the second register of each boundary circuitand performing a logical AND operation on the output of all boundarycircuits. The output of this Global AND operation can be used to controlfurther operations of the processor.

[0061] Similarly, in this embodiment, additional logic is provided forreceiving the contents of the third register of each boundary circuitand performing a logical OR operation on the output of all boundarycircuits. Again, the output of this Global OR operation can be used tocontrol further operations of processor. For example, the signal may beused to indicate that one of the CU's has reached a predeterminedcondition, and that further processing by the other CU's is notrequired. The output of the Global OR may then be used by the processorto terminate processing.

[0062] The extension circuit may be controlled by the array controller,and may be controlled dynamically in response to the length of the wordto be processed to extend or contract the number of CUs required tooperate together.

[0063] Any number of CUs can be combined to extend the length (i.e.number of bits) of the word that can be processed. For example, aplurality of eight bit computational units may be combined to form oneor more 16-bit processors, one or more 32-bit processors, one or more64-bit processors or, one or more 128-bit processors etc. Each CU has atleast one processor element they may be a 1-bit processor element or 2or more bit processor elements, and may have any number of PE's.Different CUs may contain the same number of PEs, or different number ofPEs. Thus, any number of CU's having any number of PE's may be combinedto enable a word of a given length to be parallel processed.

[0064]FIG. 5 shows an array of computational units, in which two or moreadjacent computational units may be combined into a compositecomputational unit through extension circuitry control signals (theextension circuitry is not shown in FIG. 5). In this embodiment, each CUcomprises an 8-bit CU and can be controlled to allow the CU's to operateeither individually as 8-bit parallel processors, or combined into16-bit parallel processors or 32-bit parallel processors.

[0065] In operation, when a control signal CUX_SEL is 0, thecomputational units will operate in 8-bit mode. When the control signalCUX_SEL[2:0] are 0, the CU's will operate in 8-bit mode. When theCUX_SEL[2:1] are 0, and the CUX_SEL[0] is 1, the circuit will operate in16-bit mode. When the CUX_SEL[2] are 0 and CUX_SEL[1:0] is 1, thecircuit will operate in 32-bit mode. When the CUX_SEL[2:0] are 1, allthe CUs will be operating together and the circuit will be in 256-bitmode. This embodiment is for illustrative purposes only and any otherconfigurations are possible.

[0066] Modifications and changes to the embodiments described above willbe apparent to those skilled in the art.

1. A circuit comprising a processor having a plurality of processorelements, each having an arithmetic logic unit, and a controller forcontrolling said processor elements, means for providing a respectivebit of a multiple bit word to each of said processor elements, andtransmission means for enabling signals to be transmitted between saidarithmetic logic units, to enable said units to perform a paralleloperation on the bits of said multiple bit word.
 2. A circuit as claimedin claim 1, wherein each arithmetic logic unit has an output and aninput, and said transmission means includes means for coupling theoutput of each ALU directly to the input of its adjacent ALU whichprocesses a higher order bit.
 3. A circuit as claimed in claim 1,wherein said transmission means includes means for coupling the outputof the ALU for processing the most significant bit of the multiple bitword directly to the input of the ALU for processing the leastsignificant bit of the multiple bit word.
 4. A circuit as claimed inclaim 1, wherein the processor element for processing the MSB of themultiple bit word has an input, and the transmission means includescoupling means for coupling the output of the ALU for processing the LSBdirectly to the input of MSB processing element.
 5. A circuit as claimedin claim 1, further comprising a memory coupleable to each of saidprocessor elements.
 6. A die containing a circuit as claimed in claim 1.7. A circuit comprising a plurality of computational units, each havingat least one processor element and extension means for switchablyenabling the transmission of signals between said computational units toenable said units to perform a parallel operation on a multiple bit wordwherein at least one bit of said word is provided to each computationalunit.
 8. A circuit as claimed in claim 7, wherein said extension meansis capable of coupling the output of the processor element thatprocesses the highest order bit of a computational unit to anothercomputational unit that processes the next higher order bit.
 9. Acircuit as claimed in claim 8, wherein said extension means is capableof coupling said output to the input of the processor element thatprocesses the next higher order bit of said other computational unit.10. A circuit as claimed in claim 9, wherein said processor element thatprocesses said highest order bit includes an arithmetic logic unit, andsaid output comprises the output of said arithmetic logic unit.
 11. Acircuit as claimed in claim 9, wherein said processor element thatprocesses said higher order bit has an arithmetic logic unit, and saidinput comprises an input to said arithmetic logic unit.
 12. A circuit asclaimed in claim 7, wherein said extension means includes a selectorswitch for selectively coupling said input to one of said output and aport capable of receiving a bit from said other computational unit, oranother computational unit.
 13. A circuit as claimed in claim 7, whereinsaid extension means is capable of coupling the output of the processorelement that processes the lowest order bit of a computational unit toanother computational unit that processes the next lower order bit. 14.A circuit as claimed in claim 13, wherein said extension means iscapable of coupling said output to the input of the processor elementthat processes the next lower order bit of said other computationalunit.
 15. A circuit as claimed in claim 14, wherein said processorelement that processes said lowest order bit includes an arithmeticlogic unit, and said output comprises the output of said arithmeticlogic unit.
 16. A circuit as claimed in claim 15, wherein said extensionmeans includes a selector switch for selectively coupling said input toone of said output and a port for receiving another bit either from saidother computational unit that processes the lower order bit or fromanother computational unit.
 17. A circuit as claimed in claim 7, whereinsaid extension means is capable of coupling the output of the processorelement that processes the most significant bit of the multi-bit word toat least one other computational unit that processes one or more bits ofsaid multi-bit word.
 18. A circuit as claimed in claim 17, wherein saidextension means is capable of coupling the output of said processorelement that processes the most significant bit of the multi-bit word tothe computational unit that processes the least significant bit of themulti-bit word.
 19. A circuit as claimed in claim 7, wherein extensionmeans includes a selector switch for selectively coupling one of theoutput of the processor element of a computational unit that processesthe highest order bit of that computational unit and an output of theprocessor element that processes the highest order bit of anothercomputational unit, the highest order bit of the other computationalunit being higher than the highest order bit of the one computationalunit, to an output port coupled to said one computational unit.
 20. Acircuit as claimed in claim 7, wherein said extension means is capableof coupling the output of the processor element that processes the leastsignificant bit of the multi-bit word to at least one othercomputational unit that processes one or more bits of said multi-bitword.
 21. A circuit as claimed in claim 20, wherein said extension meansis capable of coupling the output of the processor element thatprocesses the least significant bit of the multi-bit word to thecomputational unit that processes the most significant bit of themultiple bit word.
 22. A circuit as claimed in claim 7, comprising aselector switch for selectively coupling one of the output of theprocessor element that processes the lowest order bit of a computationalunit and the output of a processor element that processes the lowestorder bit of another computational unit, wherein the lowest order bit ofthe other computational unit is lower than the one computational unit,to a port coupled to the one computational unit or to a computationalunit that processes one or more higher order bits than the onecomputational unit.
 23. A circuit as claimed in claim 7, comprising afirst CU and a second CU, said first CU having a plurality of processorelements each having an arithmetic logic unit arranged together forperforming parallel operations on multiple bit data, and wherein saidextension means is arranged to enable a bit output from the MSB ALU ofthe first CU to be transmitted to the input of the LSB ALU of the secondCU.
 24. A circuit as claimed in claim 7, comprising a first CU and asecond CU, said second CU having a plurality of processor elements, eachhaving an arithmetic logic unit arranged together for performingparallel operations on multiple bit data, and wherein said extensionmeans is arranged to enable a bit output from the MSB ALU of the secondCU to be transmitted to the input of the LSB ALU of the first CU.
 25. Acircuit as claimed in claim 7, comprising a first CU and a second CU,the first CU having a plurality of processor elements, each havingarithmetic logic unit arranged together for performing paralleloperations on multiple bit data, and wherein said extension means isarranged to enable a bit output from the LSB ALU of the first CU to betransmitted to the second CU.
 26. A circuit as claimed in claim 25,wherein said second CU comprises a plurality of processing elements, andsaid extension means is arranged to enable a bit from said LSB ALU ofsaid first CU to be transmitted to the input of the MSB ALU of saidsecond CU.
 27. A circuit as claimed in claim 7, comprising a first CUand a second CU, said first and second CUs having a plurality ofprocessor elements, each having an arithmetic logic unit arrangedtogether for performing parallel operations on multiple bit data, andwherein said extension means enables the output of the MSB ALU of thesecond CU to be coupled to the input of the LSB ALU of the first CU.